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Abstract. Phylogenetic trees describe the evolutionary history of a group 
of present-day species from a common ancestor. These trees are typically 
reconstructed from aligned DNA sequence data. In this paper we analytically 
address the following question: is the amount of sequence data required to 
accurately reconstruct a tree significantly more than the amount required to 
test whether or not a candidate tree was the 'true' tree? By 'significantly', we 
mean that the two quantities behave the same way as a function of the number 
of species being considered. We prove that, for a certain type of model, the 
amount of information required is not significantly different; while for another 
type of model, the information required to test a tree is independent of the 
number of leaves, while that required to reconstruct it grows with this number. 
Our results combine probabilistic and combinatorial arguments. 
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1. Introduction 

Phylogenetic trees are widely used in evolutionary biology to describe how species 
have evolved from a shared ancestral species. In the last 25 years, aligned DNA 
sequence data and related sequences (amino acids, codons etc) have been widely 
used for reconstructing and analysing these trees [H [Tl] . Tree reconstruction meth- 
ods usually assume that sequence sites evolve according to some Markov process. 
The question of how much data is required to reconstruct a phylogenetic tree has 
been considered by a number of biologists [21 [6l [TOl [16] and is topical, as it is not 
clear whether all trees for all taxa sets could be reconstructed accurately from the 
available data. 

In earlier papers, we have analytically quantified the sequence length required 
for accurate tree reconstruction when sites evolve i.i.d. under various Markov pro- 
cesses [21 [H [3 [1] . These bounds typically depend on the number of taxa and the 
properties of the tree - in particular, how close the probability of a change of state 
('substitution') on any edge is to or to its maximum possible value. It is the 
rate of growth in the sequence with the number of taxa that is of interest here. 
The growth rate in sequence length required for accurate tree reconstruction has 
a trivial lower bound growth of log(n) - this comes from simply comparing the 
number of binary trees on n leaves with the number of collections of n sequences 
of given length. What is perhaps surprising is that for certain finite-state Markov 
processes this primitive rate of growth can be achieved for some models [3], given 
a bound on the substitution probabilities. This log(n) upper bound on sequence 
length also applies to a discrete infinite-state model (the 'random cluster model') 
[7], given similar bounds on the substitution probabilities. The log(n) behaviour for 
these two models changes to a polynomial dependence on n when the probabilities 
of state change are allowed to exceed a certain critical value. 

In this paper, we address a quite different question: namely if one has both 
the data and a proposal for a 'true' tree (i.e. the tree that produced the data 
under the model) we would like to test whether this tree is indeed the true tree, or 
whether some other tree must have produced the data. The answer provided must 
be correct with high probability. This concept of testing fits into the Popperian 
tradition - we would like to be able to refute the hypothesis that a particular tree 
produced the data, without necessarily exhibiting the tree that did. In statistics, 
the theory of testing among a discrete set of hypotheses has a long history (see, for 
example, Wald's paper from 1948 [IJ). In contemporary phylogenetics, the concept 
of testing a tree is timely, as various 'tree of life' projects begin to provide detailed, 
large candidate evolutionary trees, the question of using data to 'test' any such 
candidate tree arises. 

In this paper, we ask whether the information (sequence length) required for 
these two tasks - reconstructing versus testing - is fundamentally the same, i.e. 
that it grows at the same rate as a function of the size of the taxon set. Intuitively, 
testing should be 'easier' (require less data) than reconstructing, since for testing, 
one has additional information, and one is simply asked to make a binary decision. 
This suggested difference is somewhat analogous to the "P ^ NP" conjecture 
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in computational complexity, where any candidate solution for a problem in the 
class NP can be readily verified or refuted (in polynomial time) but it is believed 
that finding such a solution is fundamentally harder (i.e. not always possible in 
polynomial time). Of course, in our setting, we are dealing with sequence length, 
not computing time, but the two problems have an analogous flavour. 

In the next section, we describe a general framework for discussing these issues, 
and we exhibit an (abstract) example where the sequence length required to test 
a discrete parameter is far less than the sequence length required to reconstruct 
it. Turning to the phylogenetic setting in Section [3l we show that when sequence 
sites evolve i.i.d. according to a finite-state Markov process then testing requires 
sequence length growth of rate log(n) - which is the same as reconstructing requires, 
given bounds on the substitution probabilities. By contrast, for a discrete infinite- 
state Markov process, the situation is quite different - constant length sequences 
suffice to test, but order log(n) length sequences are required for reconstruction. 
We conclude the paper with some brief comments. 



2. Testing versus reconstructing 

In this section, we describe definitions and properties of testing and reconstruct- 
ing in a general setting; we will specialize our approach to the phylogenetic setting 
in the following section. 

Suppose A — An and U — Un are finite sets, and that we have a random variable 
X — X(^a.e) taking values in U and whose distribution depends on the discrete 
parameter a e A, and perhaps some nuisance parameter 9 taking values in a set 
0(a) (in the next section, A will be a set of trees, U will be a set of site patterns 
and the nuisance parameters will be edge lengths of the tree - all these concepts 
will be described later). We call ^(a.e) a parameterized random variable, and when 
the nuisance parameter is either absent or has been specified for each a (so that 
the distribution of ^(a^e) just depends on a), we refer to a simply-parameterized 
random variable. 

Given a sequence u = (ui, M2, . . . , Ufc) of fc i.i.d. observations of X(^a,e): we would 
like to use u to identify the discrete parameter a correctly with high probability. 
This reconstruction task is always possible for sufficiently large values of k provided 
a weak 'identifiability' condition holds, namely that for all a G A, and 9 G <d{a) we 
have: 

inf d{(a,9),{a',9')) >0 

a'^a,e'£0{a') 

(see [2]), where d{(a,9), {a' ,9')) denotes the h distance between the probability 
distribution of the random variables -'^(a.e) ^-nd X^^i giy 

The two tasks that we consider can be summarized, informally, as follows. 

Reconstructing: Given u E U'', determine with high probability the 
value a £ A that generated u (for some 9 € Q{a)). 



4 



MIKE STEEL, LASZLO SZEKELY AND ELCHANAN MOSSEL 



Testing: Given a candidate value a ^ A, a.s well as u G C/*^, determine 
with high probability whether or not u was generated by (a, 6) (for some 
9 e e(a)). 

We are interested in determining and comparing the number of i.i.d. samples 
required to carry out these tasks. Clearly testing is 'easier' than reconstructing 
(i.e. testing requires a smaller or equal value of k than reconstructing for the same 
accuracy), since one can always test by reconstructing and comparing the recon- 
structed object with the candidate object. Thus we will be particularly interested 
in determining whether the value of k grows at the same rate with n for these two 
tasks. 

In general a basic lower limit on k is required for reconstructing, and a much 
weaker one for testing, which shows that testing can require asymptotically much 
shorter sequences. Before describing these bounds (Proposition 12. ip . we formalize 
the concept of reconstructing and testing. 

2.1. Definitions: Reconstruction, testing and accuracy. Throughout we will 
let {X^^ e)i • • ■ e)) denote a sequence of k i.i.d. observations generated by (a, 6). 

A reconstruction method i? is a random variableQ i?(u) taking values in A and 
which depends on u E U'' . Then: 

Pfa.O) W(4m)'---^U)) = «) 

is the probability that R will correctly reconstruct a from fc i.i.d. samples generated 
by (a, 9). We say that R has (reconstruction) accuracy 1 — e (for fc samples) if, for 
all a e A and 9 £ 6(a), we have, 

Pfa,e) > 1 - e- 

Note that, if we let Pa,e{u) = 'P{X^a.u) = u), and for u e U'^ leiQ 

fc 

Pa,e(u) := ]^Pa,e(Ui), 

then: 

Pia,o)^ E P(a,e)(u)-P(i?(u) = a). 

A testing process i/' is a collection of random variables '(/'(a, u) : a G A,u E 
taking values in {true, false}. In the case where "0 assigns 'true' or 'false' with 
probability 1 (for each choice of (a,u)), we say that V' is a deterministic testing 
process. 



Reconstruction is often viewed as a deterministic function, but in reality, most methods have 
to break ties and so allowing _R to be random allows this (and perhaps other generalities) - also, we 
will always assume that reconstruction and generation are stochastically independent processes. 
^We will write Pa(u) in the case of a simply-parameterized random variable. 
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Let e > 0. We say that a testing process has accuracy 1 — e (for k samples) if 
for ah a E A, and 6 E 6(a), the foUowing two conditions hold: 

(1) P(^(a, {Xl^o), ■ ■ ■ ,4a,e))) - true) > 1 - e, 

and for any b ^ a, and 9' E 6(6), 

P(V(a, (4,,,,), . . . , 4,,,,))) = false) > 1 - e. 



In other words, ip returns 'true' with probability at least 1 — e when the discrete 
parameter a E A is tested against the data it produced, and ip returns 'false' with 
probability at least 1 — e when any other particular element of A is tested against 
the data. Note that any collection of random variables ^(a,e) has a trivial testing 
process with accuracy ^; namely for each a E A and u E U let u) = true with 
probability i. To exclude such trivialities we will generally assume that e < 5- 

If Inequality ([T]) is strengthened to: 

P(V'(a,(4„,,),...,4';^,)))=truc) = l 

for all a G A and E 6(a), we say that the testing process has strong accuracy 
1 - e. 

The following proposition shows that reconstructing in general can require con- 
siderably longer sequences than testing. 

Proposition 2.1. 

(i) Suppose a simply-parameterized random variable Xa (a E A) takes values 
in U and has a reconstruction method with accuracy strictly greater than j 

for k samples. Then \A\ < \U\'', and so k > , or, equivalently: 

log(|[/|)> ilog(|A|). 

(ii) For any e > and k — I, there exist sets A,U and a simply-parametized 
random variable Xa (for a E A) taking values in U, and a deterministic 
testing procedure ^ that has strong accuracy of 1 — e, such that: 

log(|C/|) = 0(log(log(|A|))). 



Proof. Part (i) was established in [H] (Theorem 2.1 (ii)). For Part (ii), let U = 
{1, . . . , n} and let A be a collection of subsets of U with the property that for any 
two elements a, a' E A: 

(2) < e 

^' min{|a|,|a'|} 

Consider the following simply-parameterized random variable Xa{a E A) defined 
by the rule that Xa selects one of the elements of a uniformly at random. Note that 
Xa takes values in the set U. Consider the deterministic testing process ip defined 
by the rule: 

true, if u G a; 
false, otherwise. 



'4>{a,u) 
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Clearly, 



P(^/'(a, Xa) = true) = 1, 



and Condition ^ ensures that for h ^ a: 



P('0(a,Xb) = true) < e, 



and so ij) has strong accuracy 1 — e. Thus it remains to construct a family A of 
subsets of { 1 , . . . , n} satisfying ([2]) and of cardinality at least e" ' for some 77 > 
(since in that case log(|C/|) = 0(log(log(|A|)))). 

The existence of such a large collection can be established by using the probabili- 
tistic method as follows. Construct N random subsets of {1, . . . , rt} by the following 
process: for each element i of {1, . . . ,n} place i in the set with probability n"^/"^; 
otherwise, leave that element out. Using standard results on the asymptotic distri- 
bution of sums of i.i.d. random variables, the probability p of the event that (i) all 
of the N sets are of size at least n" '^, and (ii) that all pairs of sets intersect in at 
most n"'^ points satisfies (by the subadditivity of probability): 



for positive constants ci, C2. Now, for N — e" where < ?/ < min{ci, C2} it holds 
that p > Q for sufficiently large values of n. In this collection of sets must 

exist that satisfy Conditions (i) and (ii) . Finally, this collection will also satisfy ^ 



Given a reconstruction method i?, a canonical testing process is associated 
with it, defined as follows: 

V'fi(a, u) = true R{n) = a. 
for all a G A and u e C/'^. The following lemma follows easily from the definitions. 

Lemma 2.2. Given a parameterized random variable ^(a.e) o^c^ associated 
reconstruction method R with accuracy 1 — e, the associated testing process iJjr has 
reconstruction accuracy 1 — e. 

In summary, it is clear that in general, 'testing' can require considerably less in- 
formation (sequence length) than 'reconstructing.' We now consider what happens 
in a specific setting that arises in computational evolutionary biology. 



A phylogenetic (X-) tree is a tree T, whose leaf set X is labelled and whose 
interior vertices are unlabeled and of degree at least 3. If, in addition, every interior 
vertex of T has degree exactly 3 then T is said to be binary. Without loss of 
generality, we can usually take X — {1, . . . ,n}. 

In this section, we specialize, letting A = An he the set of binary phylogenetic 
trees on leaf set {1, . . . , n}, and letting [/ = [/„ be the set of site patterns on the leaf 
set {1, . . . , n} generated under some Markov process on the tree. Let kt{n) denote 




provided n is large enough that < e. This completes the proof. 



□ 
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the sequence length required to test a phylogenetic tree with high probabihty and 
let kr (n) be the sequence length required to reconstruct a tree with high probability 
(under the same model). 

For one class of models (finite-state Markov processes with an irreducible rate 
matrix), we will show (Theorem l3.ip that kt{n) grows at least logarithmically with 
n (even in the simply-parametric setting, where each tree has a fixed set of edge 
lengths). It had already been established earlier that kr{n) grows at least log- 
arithmically for very general models of sequence evolution (and in some more re- 
stricted models and parameter sets, grows at least polynomially) [3l|4]. Thus, when 
kr{n) = 0(logn), we have kt{n) — Q{kr{n)). 

However for a closely-related Markov process - the 'random cluster model', which 
can be used to model rare genomic events, the situation is surprisingly different in 
one respect. Although reconstruction still requires at least logarithmic (and in a 
certain range polynomial) number of samples [3, testing with strong accuracy of 
1 — e can be achieved with 0(1) samples ( Theorem 13. 3p . Thus, in this case we have 
kt(n) = o{kr(ri)). Moreover this applies even in the simply-parameterized setting 
(where each tree has a fixed set of edge lengths) . 

We will now describe these results, beginning with the finite-state model. 

3.1. Testing for a finite-state Markov process requires at least log(n) 
length sequences. Finite-state Markov processes on trees underlie many ap- 
proaches in molecular phylogenetics (see, for example, [5]). We provide a brief 
formal description; for more details, the reader may wish to consult [5] or 

A finite-state Markov process is a continuous-time Markov process whose state 
space is some finite set; we will denote the rate (intensity) matrix of this process 
by R. We assume that R forms a reversible Markov process and we let tt denote 
the equilibrium distribution on the states (determined by R). 

Now, suppose we have a phylogenetic X-tree T for which each edge e has some 
strictly positive valued 'length' (^(e)). In this way, we can define a Markov process 
on T as follows (c.f. Fig. 1(a)). To some vertex w, we assign states according 
to the distribution n. Then assign states to the remaining vertices of the tree by 
orienting the edges of T away from w; for each arc (u, v) for which u has been 
assigned a state s, assign to v the state obtained by applying the continuous- 
time Markov process with initial state s for duration l{e). Thus the transition 
matrix associated to edge e is exp(i?/(e)) and the joint probability distribution 
on the vertices of T is independent of the choice of the initial vertex w (by the 
reversibility assumption) . Such a model induces a marginal distribution on the set 
of site patterns - assignments of states to the elements of X (the leaves of T) - and 
this constitutes a single sample of the process (the site pattern is a random variable 
parameterized by the pair (T, I), where I assigns the lengths to the edges of T). 

The main result of this section is the following. Recall that a rate matrix is 
irreducible if the probability of a transition from any one state to any other state 
in time i5 > is strictly positive. 
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Figure 1. (a): In a finite-state Markov process, a random state 
(a) at some vertex (w) evolves along the arcs of the tree (directed 
away from w). This gives rise to states at the internal vertices 
of the tree (in this example (a), (a), (/?)) and a site pattern at 
the leaves (in this case a, (3, 7, /3, 7 where the ordering corresponds 
to leaf order 1,2,3,4,5). (b): In the random cluster model each 
edge e is independently cut (indicated by x) with an associated 
probability p{e). In this example two edges were cut resulting in 
the leaf partition (character) of {{!}, {2, 3}, {4, 5}}. 

Theorem 3.1. Suppose we have a finite-state Markov process, with an irreducible 
rate matrix, on binary phylogenetic trees with leaf set {!,..., n}. Suppose that 
we generate k i.i.d. site patterns. Then any testing procedure that has accuracy 
> i requires k to grow at least at the rate log(n), even in the simply-parameterized 
setting where all edge lengths are equal to a fixed strictly positive value. 

Remark: Theorem 13 . 1 1 should be viewed alongside the result of [3], which shows 
that tree reconstruction (under the 2-state symmetric Markov model) with high 
accuracy is possible for sequences of length order log(n), even when the edge lengths 
are not known but constrained to lie within a fixed interval [/, g] for any / > and 
when g is sufficiently small. 

To establish Theorem 13. 1[ we first require a general result. 

Lemma 3.2. Suppose a simply-parameterized random variable Xa has a testing 
process with accuracy 1 — e for k samples. Then: 



(ii) Let A' be a proper, nonempty subset of A, and let a G A — A' . Consider the 
following random variable X' that is simply parameterized by the set {a, *}, 
and defined as follows: X'^ — Xa, and X'^ — Xy where Y is selected uni- 
formly at random from the nonempty set A' . Then X' has a reconstruction 



(i) dC") (a, a') > 2(1 - 2e) for all a,a' e A with a ^ a' , 

where d('=)(a,a') := ^ \pa{u) - pa' {u)\ 



u6C/'= 
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method with accuracy 1 — e for k samples. In particular: 
(i('=)(a,=K) > 2(1 -2e). 

Proof. Part (i). First observe that d!^^\a^a') is twice the variational distance be- 
tween the probabihty distributions Pa and Pa' on , i.e.: 

d('=)(a,a') = 2-niax|P,(i;)-Pa,(i;)|, 

E 

where maximization is over all events E on . For each a, a' G A, let -Ba,a' 
be the event that il>{a,{X^^ . . . X^^ g-^)) = true. Then Ea.a' has probabihty at 
least 1 — e when a = a' and probability at most e when a ^ a' . Consequently, 

dC'^a.a') > 2\Pa{Ea.a') - Pa'{Ea,a')\ > 2(1 - 2e). 



Part (ii). Let R -.U^ {a, *} be defined as follows: 

R{u) " 

I *, if ?/'(a, u) — false. 

Then, = P(V'(a, (^(Ve)' • • • ^(a.9))) = t^ue) > 1 - e. Moreover: 
- 1 - P(V(a, (X(V,,), . . . X(V,e))) - true), 

and: 

P(V(a, (X(V,,), . . . = true) = ^ P(^(a, (^i^^,,, . . . X^ ,))) = true) • 

By assumption, each term in the sum is < e and so V[tp{a, [X^y e)^ ■ ■ ■ -^{ve))) ~ 
true) < e. Thus, > 1 — e, as required. By Lemma l^?^ there is a testing procedure 
for X' that has accuracy at least 1 — e and so the second claim in Part (ii) now 
follows by Part (i). □ 




3 6 3m 



3771 + 1 3m + 2 

Figure 2. The generating tree in the proof of Theorem 13. II 

Proof of Theorem ] 3. ll het a be the tree shown in Fig. 2 (for convenience, we will take 
all the edge lengths in this tree to be equal). With a view to applying Lemma [321 
let A' denote the set of m trees consisting of precisely those trees obtained from 
a by interchanging the leaf labels 3i — 1 and 3i for one value of z = 1, . . . , m (and 
keeping the edge lengths fixed). Suppose there exists a testing process of accuracy 
1 — e for phylogenetic trees. Then, by Lemma [3?2r ii) we have: 

(3) S''\a,*)>2{l-2e). 
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Suppose we generate k sites i.i.d. under parameter a or * (in the case of we select 
the random element of A' and then generate k sites i.i.d. using that element). Let 
C be the random vector variable that lists the sequences occurring at the unlabelled 
internal vertices of the model tree a in Fig. 2. Note that C has the same probability 
distribution for any element b in A' as it does for a, and so, in particular, we have: 

FaiC = c) =F4C ^c) 

for all choices of c. Thus: 

dW(a,*)= J2 |Pa(u)-P.(u)|=^|^(Pa(u|c)-P,(u|c)).P,(C = c)|, 

ue[/'= u c 

and so: 
(4) 

d'^'^Ha, *) < [max^ |P,(u|c) - P,(u|c))| J Pa(C = c) = max^ |Pa(u|c)-P4u|c))|. 

\ U / C 11 

We will establish the following crucial inequality: For any c: 

(5) - < ^vv^ 

for a constant r. Then, combining (O, ([4]) and ([3]) gives j > 2(1 — 2e) and so 
k must grow at least logarithmically with n {— 3m + 2), as claimed by the Theorem. 

Thus, to establish the theorem, it suffices to justify Inequality 

For any c maximizing J2u \^a{u\c) — P,(u|c))|, let us denote P2;(u|c) by Qx{u) 
for X = a,* and b E A' . Then: 

(6) J2 - - E - ^ E 

u u 6eA' 

We can rewrite the expression on the right-hand side of ^ as: 

(^) E«.(")(-;!;g|^)i.E«au,.|ii;,i-|gl,,, 

In particular, consider the following random variable: 

for each b G A', where ^ is the random element of U'^ generated by the probability 
distribution Qa- Then the expression on the right-hand side of ([7]) can be rewritten 
as: 



\ b&A' / 



where expectation is taken with respect to the probability distribution Qa on U'^. 
Now, for all b E A' we have: 

E(Zfc) = ^ Q,(u) . [l - = ^ (Q,(u) - Q,(u)) = 1-1 = 0, 



IS TESTING A TREE EASIER? 



11 



and the Zh are independent random variables (since we have conditioned on a 
particular value C = c). Thus, by Jensen's inequality: 

(8) E(\-yz,\ \ < 



m 



bGA' 



\ m ^ — ' / -"^ — ' 

\ beA' / beA' 



Now, since E{Zb) = 0, we have 1 + E{Z^) = E{{Zb + l)^) = E(Sii|)^) and so: 

'QbiO'^ 



(9) E(Z^^) = -1 + E 



Also, writing u = (ui, . . . , Uk) and c = (ci, . . . , Ck), the irreducibility condition of 
the Markov process ensures that the following inequality holds: 

for some absolute constant r dependent only on the rate matrix R and the (equal) 
value of the common edge length. Thus, by independence, < t'^ , and so ^ 

gives: 

Consequently, by ([8|): 



e(|-E^''|) <\/Ai^'I 



|t2'= < T7^A7^, 

and so (by dH) and ([7])) we have verified Inequality (O, and thereby completed the 
proof. □ 



3.2. An 0(1) test for the random cluster model. In this section, a character 
(on X) denotes an arbitrary partition {ai, . . . , am} of X into any number of disjoint 
subsets. 

In the random cluster models one has a phylogenetic X-tree T and each edge 
e has an associated probability p{e) that the edge of T is cut. These cuts are 
performed independently across the tree, resulting in a (generally disconnected) 
graph and the leaves in each connected component form the blocks of a resulting 
random character on X. Thus T along with the p{e) values (called 'substitution 
probabilities') provide a well-defined probability distribution on characters on X 
(see Fig. 1(b)). 

This is the same distribution on characters as one obtains in the limit as s ^ oo 
from a finite-state Markov process on T that has an s x s rate matrix in which 
all its off-diagonal entries are equal, and where one considers the character on X 
whose blocks are the sets of leaves of the same state. Thus we can view the random 
cluster model as a type of infinite-state Markov process. The model is relevant 
for describing evolution in settings where transitions generally lead to states that 
have not appeared elsewhere in the tree (such as with gene order re-arrangement, 
or other rare genomic events). 
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Given a character x on X and a phylogenetic X-tree T, let Ta denote the 
minimal subtree of T connecting the leaves of T in block a. Then x is said to be 
homoplasy-free on T if the collection of trees 7^ : a € x is vertex-disjoint. Given 
a sequence C = (xii X2, ■ ■ • , Xfe) of characters, consider the following deterministic 
testing process tpn on phylogenetic trees: 




Figure 3. Canonical representation of two different binary phy- 
logenetic trees in the proof of Theorem 13.31 

Theorem 3.3. Under the random cluster model on binary phylogenetic trees with 
leaf set {!,..., n}, suppose that we generate k i.i.d. characters, where the sub- 
stitution probability p{e) on any edge e lies in some fixed interval [a, b] where 
{) < a < b < ^. Then ipH is a testing process with strong accuracy of 1 — e 
whenever the number of characters is at least k, where: 

fc = • log(e-i), 

and where 7 = 0- ■jfi^^- This holds for all values of n. 

To prove Theorem 13.31 we first require a combinatorial lemma. 

Lemma 3.4. Suppose 7i,72 are two binary phylogenetic X-trees, with Ti 7^ T2, 
and \X\ > 3. There exist induced rooted phylogenetic subtrees Ta, Tb, on leaf sets 
A and B, respectively, where A, B are disjoint, nonempty subsets of X , such that: 

(i) Ta and Tb are present as pendant subtrees in Ti and T2; and 
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(ii) the root of Ta and of Tb are adjacent to a common vertex in T2 hut not in 
Ti. 

Proof. The proof is by induction on |X|. For \X\ = 4, the result is easily seen to 
hold. Suppose the lemma holds for \X\ — n— \ where n > 5, and that |X| = n. If 72 
has a cherry (a pair of elements {x, y} of X that label leaves that are adjacent to a 
common vertex) that is not also a cherry of 7^, then we can take A = {x}, B = {y} 
and the claim in the theorem holds. Otherwise, every cherry of 72 is a cherry of 7i 
and since 72 has at least one cherry ( say {x, ) we may replace 7i, 72 with the pair 
of trees T(, 7^' obtained by deleting the leaves labeled {x, y} and their incident edges 
from each tree, and assigning each newly created leaf vertex the label Vx,y Note 
that T{, T2 are binary phylogenetic trees, with a leaf label set X — {x, y} U {vx,y}, 
and that 7^' 7^ 7^' (otherwise it is easily seen that Ti = T2). Thus we may apply the 
induction hypothesis to the pair Tj', 7^'. Given the sets A, B for this pair that meet 
the requirements stated in the lemma, we then replace any occurrence of Vx,y in A 
or B by the elements x,y - the resulting pair of sets now satisfies the requirements 
stated in the lemma. This completes the proof. □ 

Proof of Theorem \3.3\ Suppose that 7i is the binary phylogenetic X~tree that gen- 
erates the characters. Then i/)// ( J^, 7i ) = 'true' with probability 1, since the event 
X is homoplasy-free on 7i has probability 1 for any character x that evolves on 7i 
under the random cluster model. Now, suppose that 72 is a binary phylogenetic 
X-tree that is different from 71 . By Lemma 13.41 72 and 7i both share the same 
pair of pendant subtrees Ta and Tg for which the roots are adjacent in 72 but not in 
7i, as illustrated in Fig. 3. Now consider the evolution of one of the characters Xi 
under the random cluster model on 7i. Let a, (3 denote, respectively, the blocks of 
the character at vertices u,v oiTl, and consider the conjunctive event E = n^=i 
in which: 

(El) a^P; 

{E2) at least one leaf of Ta is present in a; 

(S3) at least one leaf of the other subtree of T, which is incident with u and 

does not contain w, is present in a; 
{E4) at least one leaf of Tb is present in f3; and 

(S5) at least one leaf of the other subtree of T, which is incident with v and 
does not contain u, is present in (3. 

Under the random cluster model, these five events are independent (by the assump- 
tion that the cuts on edges are made independently) and so: 

5 5 
nE)^P{f]E,)^l[F{E,)>j, 

i=l i=l 

since P(£:i) > a, and, by Lemma 2.1 of 7J, P(£;,) > (l-26)/(l-5) for i e {2, . . .,5}. 
Now, tpHi^jT) = 'false' whenever event E occurs. If we evolve k independent 
characters under the assumptions stated in the Theorem, then the probability that 
E occurs at least once is at least 1 — (1 — 7)*^, and this probability is at least 1 — e 
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when fc > 7 ^ log(e by virtue of the inequaUty 1 — (1 — x)^ > 1 — e This 
completes the proof. □ 

4. Concluding comments 

The reader may be curious as to where our proof for the log(n) lower bound 
on sequence length for testing under the finite-state model breaks down for the 
random cluster model. The crucial distinction is that the random cluster model 
fails to satisfy condition (jlOp required in the proof for Theorem 13. II That is, in the 
random cluster model, some characters have a positive probability on some trees but 
have zero probability on other trees. Indeed it has been shown that, for any binary 
phylogenetic tree T with n leaves, there is a set of just four characters such that T is 
the only tree for which these characters have strictly positive probability [8] . Thus, 
in contrast to finite-state models, under the random cluster model each tree can be 
reconstructed from 0(1) characters (using, say maximum likelihood estimation), 
provided these characters are carefully selected; if the characters evolve under the 
random cluster model then, as mentioned earlier, the number required number of 
characters for accurate tree reconstruction grows at the rate of (at least) log(n) [7]. 

Note also that Theorem l3.3l can be extended to a setting in which the substitution 
probabilities vary from character to character, provided they all lie in some fixed 
interval [a, b] where 0<a<6<|. If we generate k characters independently (but 
not necessarily identically) in this more general setting, testing the true tree using 
ipH will return 'true' with probability 1, while testing any other tree will return 
'false' with probability at least 1 — e provided k satisfies the lower bound described 
in Theorem 13.31 

Although testing for the finite-state Markov model can require the same i7(log(n)) 
growth in sequence length required for reconstructing, there is a related task where 
0(1) sequence length suffices. This is for teasing a tree, where one is given sequences 
of length k and a set of two trees - one of which is the tree that generated the data, 
and one is asked to identify which of the two trees generated the data. For the sym- 
metric 2-statc Markov process and under suitable restrictions on the substitution 
probabilities, it was shown in fl5' that sequences of length 0(1) (i.e. independent 
of n) suffice to correctly solve (with high probability) the teasing problem on binary 
phylogenetic trees with n leaves. 

Finally, we have considered reconstruction only in the part of the parameter 
range (on the substitution probabilities on the edges of the tree) where reconstruc- 
tion requires logarithmic length sequences. Outside of this region, it is known that 
polynomial-length sequences can be required, both for the finite state Markov model 
[3] and for the random cluster model [7]. It may be of interest in future work to 
determine the sequence length required for testing in these portions of parameter 
space. 
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